Skip to content

Conversation

xadupre
Copy link
Member

@xadupre xadupre commented Sep 1, 2025

Description

Completes the implementation of Matmul and Gemm for float16 on CPU.

Motivation and Context

See issue #25824, a benchmark should validate that because float32 is usually faster than float16 on CPU. Right now, optimizers insert cast to be able to run the graph. It still seems the best approach. The benchmark needs to run be on other processors to see when the kernel should be added.

float32 size=256, time=0.00023763089993735775s per iteration
float16 size=256, time=0.01750569239993638s per iteration
float32 size=512, time=0.0013353320497117237s per iteration
float16 size=512, time=0.11820917490003921s per iteration
float32 size=1024, time=0.013059543249983107s per iteration
float16 size=1024, time=0.9459635703999083s per iteration
benchmark code
  # code
  import time
  import numpy as np
  import onnx
  import onnx.helper as oh
  import onnxruntime
  
  
  def model_type(itype):
      return oh.make_model(
          oh.make_graph(
              [oh.make_node("MatMul", ["X", "Y"], ["Z"])],
              "b",
              [
                  oh.make_tensor_value_info("X", itype, ["a", "a"]),
                  oh.make_tensor_value_info("Y", itype, ["a", "a"]),
              ],
              [oh.make_tensor_value_info("Z", itype, ["a", "a"])],
          ),
          ir_version=10,
          opset_imports=[oh.make_opsetid("", 18)],
      )
  
  
  sess16 = onnxruntime.InferenceSession(
      model_type(onnx.TensorProto.FLOAT16).SerializeToString(),
      providers=["CPUExecutionProvider"],
  )
  sess32 = onnxruntime.InferenceSession(
      model_type(onnx.TensorProto.FLOAT).SerializeToString(),
      providers=["CPUExecutionProvider"],
  )
  
  N = 20
  for size in [256, 512, 1024]:
  
      # float32
  
      f32 = dict(
          X=np.random.randn(size, size).astype(np.float32),
          Y=np.random.randn(size, size).astype(np.float32),
      )
      # warmup
      for i in range(10):
          sess32.run(None, f32)
  
      # measure
      begin = time.perf_counter()
      for i in range(20):
          sess32.run(None, f32)
      duration = time.perf_counter() - begin
      print(f"float32 size={size}, time={duration / N}s per iteration")
  
      # float16
  
      f16 = {k: v.astype(np.float16) for k, v in f32.items()}
      # warmup
      for i in range(10):
          sess16.run(None, f16)
  
      # measure
      begin = time.perf_counter()
      for i in range(N):
          sess16.run(None, f16)
      duration = time.perf_counter() - begin
      print(f"float16 size={size}, time={duration / N}s per iteration")
  ```

</details>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant